fix: deduplicate ZipStore central directory on close by nuglifeleoji · Pull Request #3858 · zarr-developers/zarr-python

nuglifeleoji · 2026-03-31T19:56:13Z

ZIP files are append-only: writing the same key a second time adds a new entry without removing the old one. When an array is resized, zarr.json is rewritten, which accumulates duplicate filenames in the central directory and produces:

UserWarning: Duplicate name: 'zarr.json'

Many zip readers (e.g. KDE Ark, Windows Explorer) show only the first entry for a duplicated name, making the stored data appear stale or corrupt. The duplicates also waste space.

Fix: ZipStore.close() now calls _dedup_central_directory() before closing the file. It scans filelist in reverse and keeps only the last (most recent) entry for each filename, so the on-disk central directory is clean. This follows the approach suggested in the issue by @mkitti.

store = ZipStore("data.zip", mode="w")
arr = zarr.create_array(store, shape=(5,), dtype="i4")
arr.resize((10,))   # rewrites zarr.json
store.close()

# Before: zarr.json appeared twice in the central directory
# After:  only one entry for zarr.json

Closes #3580

Made with Cursor

ZIP files are append-only: writing the same key (e.g. zarr.json) a second time adds a new entry without removing the old one. When an array is resized, zarr.json is rewritten, which accumulates duplicates in the central directory and triggers UserWarning: Duplicate name. On close(), _dedup_central_directory() now scans filelist in reverse and keeps only the last (most recent) entry for each filename, so the on-disk central directory is clean. Closes zarr-developers#3580 Made-with: Cursor

d-v-b · 2026-03-31T20:01:04Z

what happens if that scan is interrupted?

Made-with: Cursor

nuglifeleoji · 2026-03-31T20:17:28Z

Good point — the original code updated filelist and NameToInfo in two separate steps, so an interrupt between them would leave the object inconsistent. Fixed in the latest commit: both new structures are built upfront, then swapped in together, so if anything raises before the swap the original state is still intact.

Python's zipfile warns when the same filename is written twice. ZipStore is append-only by design; duplicates are cleaned up by _dedup_central_directory() on close. Suppress the warning at the source so test suites with filterwarnings=error don't treat it as a failure. Made-with: Cursor

nuglifeleoji · 2026-04-02T14:30:11Z

The CI failures are caused by tests that expected the 'Duplicate name' UserWarning to still be emitted. Since this PR deduplicates entries on close, the warning is no longer raised and those tests fail with 'DID NOT WARN'.

Two options:

Keep the dedup silent (current approach) and update all test assertions to not expect the warning
Emit the warning from close() when dedup is needed, so existing tests pass

Which approach would you prefer? Happy to update accordingly.

mkitti

NameToInfo already has a unique set of filenames in its keys and the latest zipinfo entries as its values, I believe. Please double check this.

If so, we could just NameToInfo to update the filelist. By using zipfile.getinfo to retrieve information from NameToInfo we could implement everything using public API.

mkitti · 2026-04-06T18:34:31Z

src/zarr/storage/_zip.py

+            return
+        seen: set[str] = set()
+        deduped: list[zipfile.ZipInfo] = []
+        for info in reversed(self._zf.filelist):


Suggested change

for info in reversed(self._zf.filelist):

for info in reversed(self._zf.infolist()):

Rather than accessing filelist directly here, use the public API infolist().

mkitti · 2026-04-06T22:47:43Z

src/zarr/storage/_zip.py

+        new_filelist = list(reversed(deduped))
+        new_name_to_info = {info.filename: info for info in new_filelist}
+        # Swap both attributes together; if anything above raised, the
+        # original filelist/NameToInfo are still intact.
+        self._zf.filelist = new_filelist
+        self._zf.NameToInfo = new_name_to_info


Looking at NameToInfo I have a few questions and observations:

Rather than constructing seen, a set, and deduped, a list, would it not be easier to just construct a dict?

That dict seems equivalent to NameToInfo.

The original NameToInfo seems to have what we want already. It contains a unique set of filenames as its keys and the latest version of each file as its values (double check this).

Perhaps all that we need to do is set the contents of filelist to the values of NameToInfo?

self._zf.filelist = list(self._zf.NameToInfo.values())

There might even be a way to do this completely with the public API.

# zipfile.infolist() is public API _filelist = self._zf.infolist() # Make sure that the public API returns the SAME Python object as zipfile.filelist, which is NOT public API assert _filelist is self._zf.filelist, "The assumption that zipfile.infolist() returns zipfile.filelist is not valid" # zipfile.namelist() is public API _namelist = np.unique(self._zf.namelist()) # Manipulate the list in place _filelist.clear() # zipfile.getinfo is just NameToInfo.get _filelist.extend({self._zf.getinfo(name) for name in _namelist})

Thanks for the review, @mkitti!

I checked the CPython zipfile source and the NameToInfo is updated via dict assignment (NameToInfo[name] = zinfo), so later writes always overwrite earlier ones. This means NameToInfo already holds the latest ZipInfo for each filename. I also confirmed that infolist() returns self.filelist directly (not a copy), and getinfo(name) is just NameToInfo[name].

What I changed:
Since NameToInfo is already correct, only filelist needs to be rebuilt. I replaced the seen/deduped loop with:
unique_names = dict.fromkeys(self._zf.namelist()) # public API, preserves order, deduplicates
self._zf.filelist = [self._zf.getinfo(name) for name in unique_names] # public API
NameToInfo no longer needs to be updated at all. The new list is computed before assignment, so atomicity is preserved.

One note on your public-API snippet:
The set comprehension {self._zf.getinfo(name) for name in _namelist} loses insertion order, and np.unique() sorts names alphabetically — both could reorder the entries in filelist. I used dict.fromkeys() instead to deduplicate while preserving first-seen order, with no numpy dependency needed.

Thanks so much on your review, and I am really grateful and eager to solve other problems

NameToInfo is already a {filename -> latest ZipInfo} mapping because each writestr/write call does NameToInfo[name] = zinfo, so later writes overwrite earlier ones (confirmed via CPython zipfile source). Replace the manual seen/deduped loop with a two-liner that uses public API: - namelist() to get all filenames (preserving insertion order) - dict.fromkeys() to deduplicate while keeping first-seen order - getinfo() to retrieve the latest ZipInfo for each unique name Only filelist needs to be updated; NameToInfo is already correct and does not need to be rewritten. The new list is computed before assignment so the ZipFile is never left in an inconsistent state on exception. Made-with: Cursor

Leo Ji added 2 commits March 31, 2026 20:01

fix: use 'not in' for ruff E713

b155426

Made-with: Cursor

dedup: compute new filelist atomically before swapping in

51d34f0

Made-with: Cursor

mkitti suggested changes Apr 6, 2026

View reviewed changes

nuglifeleoji force-pushed the fix/zipstore-dedup-on-close branch from 406e49d to eebe867 Compare April 7, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: deduplicate ZipStore central directory on close#3858

fix: deduplicate ZipStore central directory on close#3858
nuglifeleoji wants to merge 5 commits intozarr-developers:mainfrom
nuglifeleoji:fix/zipstore-dedup-on-close

nuglifeleoji commented Mar 31, 2026

Uh oh!

d-v-b commented Mar 31, 2026

Uh oh!

nuglifeleoji commented Mar 31, 2026

Uh oh!

nuglifeleoji commented Apr 2, 2026

Uh oh!

mkitti left a comment

Uh oh!

mkitti Apr 6, 2026

Uh oh!

mkitti Apr 6, 2026 •

edited

Loading

Uh oh!

nuglifeleoji Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for info in reversed(self._zf.filelist):
	for info in reversed(self._zf.infolist()):

Uh oh!

Conversation

nuglifeleoji commented Mar 31, 2026

Uh oh!

d-v-b commented Mar 31, 2026

Uh oh!

nuglifeleoji commented Mar 31, 2026

Uh oh!

nuglifeleoji commented Apr 2, 2026

Uh oh!

mkitti left a comment

Choose a reason for hiding this comment

Uh oh!

mkitti Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

mkitti Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuglifeleoji Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkitti Apr 6, 2026 •

edited

Loading